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Abstract 

Based  on  the  classic  augmented  Lagrangian  multiplier  method,  we  propose,  analyze  and 
test  an  algorithm  for  solving  a  class  of  equality-constrained  non-smooth  optimization  prob¬ 
lems  (chiefly  but  not  necessarily  convex  programs)  with  a  particular  structure.  The  algorithm 
effectively  combines  an  alternating  direction  technique  with  a  nonmonotone  line  search  to  min¬ 
imize  the  augmented  Lagrangian  function  at  each  iteration.  We  establish  convergence  for  this 
algorithm,  and  apply  it  to  solving  problems  in  image  reconstruction  with  total  variation  reg¬ 
ularization.  We  present  numerical  results  showing  that  the  resulting  solver,  called  TVAL3,  is 
competitive  with,  and  often  outperforms,  other  state-of-the-art  solvers  in  the  field. 


1  Introduction 

1.1  A  Class  of  Non-Smooth  Minimization  Problems 

In  this  paper,  we  consider  solving  a  class  of  equality-constrained  minimization  problems  of  the  form 

min  f(x,y),  s.t.  h(x,  y)  =  0,  (1) 

x,y 

where  x  E  Mni,  y  E  HU12,  the  vector- valued  function  h  :  Mni+n2  (to  <  m+n2)  is  differentiable 

with  respect  to  both  x  and  y,  but  the  function  /  may  or  may  not  be  differentiable  with  respect 
to  y.  In  addition,  we  will  later  impose  a  special  structure  on  such  problems  under  consideration. 
For  solving  this  class  of  problems,  we  will  propose  and  study  an  algorithm  in  the  framework  of  the 
classic  augmented  Lagrangian  multiplier  (ALM)  method,  first  suggested  by  Hestenes  [20]  and  Powell 
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[28].  In  the  ALM  framework,  one  obtains  the  k- th  iterate  ( xk,yk )  by  minimizing  the  augmented 
Lagrangian  function 

CA(x,  y,  A;  y)  =  f(x,  y)  -  A Th{x,  y)  +  y)Th{x,  y),  y>0,  (2) 

jointly  with  respect  to  both  x  and  y  for  a  given  multiplier  estimate  A  =  \k-i-,  then  updates  the 
multiplier  estimate  by  the  formula  A*,  =  Xk-i  —  yh(xk,yk)-  In  principle,  the  positive  parameter  y 
in  the  augmented  Lagrangian  function,  known  as  the  penalty  parameter,  can  also  be  altered  from 
iteration  to  iteration. 

It  is  evident  that  the  iteration-complexity  of  an  ALM  algorithm  depends  almost  entirely  on  how 
the  augmented  Lagrangian  function  is  minimized  jointly  with  respect  to  both  x  and  y.  In  order  to 
solve  such  subproblems  efficiently,  one  should  utilize  useful  structures  existing  in  the  augmented 
Lagrangian  function.  Therefore,  we  concentrate  on  solving  unconstrained  minimization  problems 
of  the  form 

min  4>(x,y),  (3) 

x,y 

where  cf  is  differentiable  with  respect  to  the  block  variable  x  but  not  necessarily  to  y. 

In  this  paper,  we  assume  that  the  objective  function  (f>(x,  y)  in  (3)  has  the  following  qualitative 
structure,  called  a  “ structure  of  uneven  complexity that  is,  in  some  measurement, 

the  complexity  of  minimizing  cf>{x,  y)  with  respect  to  y  is 
much  lower  than  that  of  minimizing  with  respect  to  x. 

For  example,  for  i,j/£  Mn  a  function  (j>(x,y)  has  a  structure  of  uneven  complexity  if  the  complexity 
of  minimizing  <t>(x,  y)  with  respect  to  y  is  O(n)  while  that  of  minimizing  with  respect  to  x  is  0(n 2) 
or  higher. 

With  little  loss  of  generality,  we  assume  that 

y(x)  =  argmin  cj>(x,y)  (4) 

y 

exists  and  is  unique  for  each  x  in  a  region  of  interest.  Consequently,  problem  (3)  can  be  reduced 
to  an  unconstrained  minimization  problem  with  respect  to  x  only;  that  is, 

min  if(x)  =  c/)(x,y(x)),  (5) 

X 

where  if(x)  is  generally  non-differentiable,  but  d\ <f>,  the  partial  derivative  of  (j)(x,  y)  with  respect  to 
x,  is  assumed  to  exist. 

To  solve  the  unconstrained  minimization  problem  (5),  we  will  construct  a  nonmonotone  line 
search  algorithm  which  is  a  modification  of  the  one  [37]  proposed  by  Zhang  and  Hager.  The 
modification  is  necessary  since,  to  the  best  of  our  knowledge,  existing  nonmonotone  line  search 
algorithms  require  that  objective  functions  be  differentiable,  or  at  least  have  their  sub-differentials 
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available.  In  our  case,  however,  we  do  not  require  the  availability  of  the  sub-differential  of  x ). 

Instead,  we  only  use  d\ <f>  —  the  partial  derivative  of  (f>(x ,  y )  with  respect  to  x. 

Problem  (1)  has  a  wide  range  of  applications.  For  example,  a  large  number  of  problems  in 
physics,  mechanics,  economics  and  mathematics  can  be  reduced  to  variational  problems  of  the 
form: 

minf(x)  =  fi{x)  +  f2(Bx ), 

X 

where  both  f\  and  f2  are  convex,  proper,  lower  semicontinuous  functions,  and  B  is  a  linear  operator. 
We  consider  the  case  that  f\  is  differentiable  but  f2  is  not.  In  the  early  1980s,  Glowinski  et  al. 
studied  this  type  of  problems  in  depth  using  the  ALM  and  operator-splitting  methods  [14,  16], 
which  also  have  close  ties  to  earlier  works  such  as  [24] .  Clearly,  the  above  unconstrained  variational 
problem  is  equivalent  to 

min  fi(x)  +  f2(y),  s.t.  Bx  -  y  =  0.  (6) 

X 

It  is  evident  that  problem  (6)  is  a  special  case  of  problem  (1).  As  we  will  see  in  a  concrete  example 
below,  whenever  f2  is  a  separable  function,  minimizing  the  augmented  Lagrangian  function  of  (6) 
with  respect  y  is  likely  to  be  trivial. 

1.2  An  Example:  Total  Variation  Minimization  for  Compressive  Sensing 

In  recent  years,  a  new  theory  of  compressive  sensing  (CS)  [11,  7,  6]  —  also  known  under  the 
terminology  of  compressed  sensing  or  compressive  sampling  —  has  drawn  considerable  research 
attention.  It  provides  an  alternative  approach  to  data  acquisition  and  compression  that  reduces 
the  number  of  required  samples,  which  could  translate  into  simpler  sensors,  short  sensing  time, 
and/or  reduced  transmission/storage  costs  in  suitable  applications.  The  theory  indicates  that  a 
sparse  signal  under  some  basis  may  still  be  recovered  even  though  the  number  of  measurements  is 
deemed  insufficient  by  Shannon’s  criterion.  Nowadays,  CS  has  been  widely  studied  and  applied  to 
various  fields  (see  [22,  10,  21,  13,  38,  39]  for  example). 

Given  measurements  b,  instead  of  finding  the  sparsest  solution  to  Ax  =  b  by  a  combinatorial 
algorithm,  which  is  generally  NP-hard  [25],  one  often  chooses  to  minimize  the  G  -norm  or  the 
total  variation  (abbreviated  TV)  of  x.  In  the  context  of  CS,  sufficient  conditions  for  exact  and 
stable  recoveries  are  given  in  [12]  and  [6].  The  use  of  TV  regularization  instead  of  the  t\  term 
makes  the  reconstructed  images  sharper  by  preserving  the  edges  or  boundaries  more  accurately. 
Instead  of  assuming  the  signal  is  sparse,  the  premise  of  TV  regularization  is  that  the  gradient  of  the 
underlying  signal  or  image  is  sparse.  In  spite  of  those  remarkable  advantages  of  TV  regularization, 
the  properties  of  non-differentiability  and  non-linearity  make  TV  minimization  computationally 
more  difficult  than  i\  minimization. 

The  use  of  TV  has  a  long  and  rich  history.  It  was  introduced  into  imaging  denoising  problems 
by  Rudin,  Osher  and  Fatemi  in  1992  [30].  From  then  on,  TV  minimizing  models  have  become  one 
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of  the  most  popular  and  successful  methodologies  for  image  denoising  [30,  8],  deconvolution  [9,  33] 
and  restoration  [5,  31,  34,  35],  to  cite  just  a  few. 

Specifically,  the  noise-free  discrete  TV  model  for  CS  reconstruction  can  be  written  as 

minTV(x)  =  ||H.;x|L,  s.t.  Ax  =  b,  (7) 

i 

where  x  G  Mn,  or  x  G  with  s  -  t  =  n,  represents  an  image,  DiX  G  M2  is  the  discrete  gradient  of 
x  at  pixel  i,  A  G  Mmxn  ( m  <  is  the  measurement  matrix,  b  G  Rm  is  the  measurement  of  x,  and 
p  =  1  or  2.  The  £p-norm  could  either  be  the  l^-norm  corresponding  to  the  isotropic  TV,  or  the  i\- 
norm  corresponding  to  the  anisotropic  TV.  For  reconstruction  from  noise-corrupted  measurements, 
one  can  solve  the  ROF  model  instead 

minTV(x)  +  ^||Ar  —  b\\2. 

x  2 

For  convenience,  ||.||  refers  to  the  I2  norm  hereafter. 

To  separate  the  non-differentiable  7p-riorm  term,  we  can  split  variables  by  introducing  y i 
Then  models  (7)  and  (8)  are  equivalent  to,  respectively, 

min  >  \\yi\\v,  s.t.  Ax  =  b  and  DiX  =  y*  Vi,  (9) 

yi,x 

l 

and 

mm'Yl  WyiWp  +  ^\\Ax  -  b\\2 ,  s.t.  Dtx  =  yt  Vi.  (10) 

Vi  Z 

l 

Both  models  (9)  and  (10)  can  be  regarded  as  special  forms  of  (1),  while  the  non-differentiable  parts 
of  their  augmented  Lagrangian  functions  are  easy  to  solve  due  to  separability. 

1.3  A  Brief  Overview  of  Related  Works 

For  the  method  proposed  in  this  paper,  the  ALM  method  plays  a  key  role,  which  has  been  well 
researched  for  decades.  The  general  quadratic  penalty  method  turns  a  constrained  optimization 
problem  into  a  series  of  unconstrained  problems  by  penalizing  constraint  violations.  However,  in 
theory  it  requires  the  penalty  parameter  to  go  to  infinity  to  guarantee  convergence,  which  may  cause 
a  deterioration  in  the  numerical  conditioning  of  the  method.  In  1969,  Hestenes  [20]  and  Powell 
[28]  independently  proposed  the  ALM  method  which,  by  introducing  and  adjusting  Lagrangian 
multiplier  estimates,  no  longer  requires  the  penalty  parameter  to  go  to  infinity  for  the  method 
to  converge.  The  augmented  Lagrangian  function  differs  from  the  standard  Lagrangian  function 
with  an  additional  square  penalty  term,  and  differs  from  the  quadratic  penalty  function  with  an 
additional  term  involving  the  multiplier  A  times  the  constraints. 
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Numerically,  it  is  usually  impossible  to  find  an  exact  minimizer  of  unconstrained  minimization 
subproblem  (2)  at  each  iteration.  For  convex  optimization,  Rockafellar  [29]  proved  the  global  con¬ 
vergence  of  the  ALM  method  for  any  positive  penalty  parameter  value,  as  long  as  the  subproblems 
are  solved  to  prescribed  tolerances  of  increasing  accuracy.  In  addition,  the  convergence  theorem 
holds  without  assuming  the  objective  function  /  to  be  differentiable. 

Extending  the  classic  ALM  method,  Glowinski  et  al.  [17,  15]  also  suggested  the  so-called  alter¬ 
nating  direction  method  (often  abbreviated  as  ADM).  Instead  of  jointly  minimizing  the  augmented 
Lagrangian  function  (2)  at  each  iteration,  ADM  only  demands  minimizers  with  respect  to  x  and  y 
respectively  before  updating  the  multiplier,  which  may  produce  computationally  more  affordable 
iterations. 

ADM  is  most  effective  when  both  subproblems  can  be  efficiently  and  accurately  solved,  which 
is  not  always  possible.  For  example,  in  TV  minimization  model  (9)  or  (10),  one  of  the  subproblems 
is  usually  quadratic  minimization  that  dominates  the  computation.  Without  further  special  struc¬ 
tures,  accurately  solving  such  a  quadratic  minimization  problem  at  each  iteration  can  be  excessively 
expensive. 

Recently,  it  has  been  discovered  that  ALM  can  also  be  derived  through  an  alternative  approach 
called  Bregman  regularization  [27,  36].  In  particular,  Goldstein  and  Osher  [18]  applied  Bregman 
regularization  to  a  split  formulation  in  [33]  to  derive  an  algorithm  for  TV  minimization  called  split 
Bregman,  which  is  equivalent  to  ALM.  In  their  computational  experiments,  however,  they  just  used 
one  sweep  of  alternating  direction  iteration  to  approximately  minimize  the  augmented  Lagrangian, 
resulting  in  a  numerically  efficient  implementation  that  turns  out  to  be  equivalent  to  ADM. 

Several  solvers  have  been  developed  to  solve  TV  minimization  problem  (7)  or  (8),  or  other 
variants.  Among  them,  £i-Magic  [6,  7],  TwIST  [3,  4]  and  NESTA  [2]  have  been  widely  used. 
Specifically,  £i-Magic  solves  a  second-order  cone  reformulation  of  TV  models.  TwIST  implements 
a  two-step  iterative  shrinkage/thresholding  (1ST)  algorithms,  which  exhibits  much  faster  conver¬ 
gence  rate  than  1ST  itself  when  the  linear  observation  operator  is  ill-conditioned.  NESTA  is  based 
on  Nesterov’s  smoothing  technique  [26],  extended  to  TV  minimization  by  modifying  the  smooth 
approximation  of  the  objective  function.  In  Section  4,  we  will  apply  our  proposed  method  to  TV 
minimization  with  comparison  to  the  above  three  state-of-the-art  solvers. 

1.4  Contributions 

The  classic  ALM  method  is  a  fundamental  and  effective  approach  in  constrained  optimization. 
However,  to  apply  it  to  realistic  applications,  it  is  critically  important  to  design  subproblem  solvers 
capable  of  taking  advantages  of  useful  problem  structures.  In  this  work,  we  consider  a  rather 
common  structure  that  in  minimizing  a  non-smooth  augmented  Lagrangian  function  of  two  block 
variables,  it  is  much  easier  to  minimize  with  respect  to  one  of  the  variables  (in  which  the  function 
may  not  be  differentiable)  than  with  respect  to  the  other.  This  structure  of  uneven  complexity 
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exists  in  a  wide  range  of  application  problems. 

We  construct  an  efficient  method  for  problems  with  the  uneven  structure  that  integrates  two 
existing  algorithmic  ideas:  (a)  alternating  direction  and  (b)  nonmonotone  line  search.  The  former 
enables  taking  full  advantages  of  the  low-cost  minimization  in  the  “easy”  direction;  and  the  latter 
allows  relatively  quick  and  large  steps  in  the  “hard”  direction.  We  are  able  to  establish  convergence 
for  this  algorithm  by  extending  existing  theoretical  results. 

Our  numerical  results  on  image  reconstruction  with  TV  regularization  show  that  the  proposed 
algorithm  is  robust  and  efficient,  significantly  outperforming  several  state-of-the-art  solvers  on  most 
tested  problems.  The  resulting  MATLAB  solver,  called  TVAL3,  has  been  posted  online  [23]. 

2  Algorithm  Construction 

In  this  section,  we  first  describe  a  first-order  algorithm  for  solving  the  non-smooth,  unconstrained 
minimization  problem  (5).  The  algorithm  is  an  extension  to  the  one  in  [37]  designed  for  minimiz¬ 
ing  smooth  functions.  This  non-smooth  unconstrained  minimization  algorithm  is  then  embedded 
into  the  classic  ALM  framework  to  form  the  backbone  of  an  algorithm  for  solving  the  equality- 
constrained  optimization  problem  (1).  The  motivation  for  the  proposed  algorithms  is  to  take  full 
advantages  of  the  structure  of  uneven  complexity ,  explained  above,  so  that  the  derived  algorithm 
has  a  relatively  low  iteration-complexity. 

2.1  Nonmonotone  Line  Search  for  Smooth  Functions 

Grippo,  Lampariello  and  Lucidi  [19]  proposed  a  nonmonotone  line  search  scheme  in  1986.  In  stead 
of  requiring  a  monotone  decrease  in  the  objective  function  value  as  in  the  classic  line  search  schemes, 
it  only  enforces  a  decrease  in  the  maximum  of  previous  k  function  values.  More  recently,  Zhang 
and  Hager  [37]  modified  their  line  search  scheme  by  replacing  the  “maximum”  by  a  “weighted 
average”  of  all  the  previous  function  values,  and  showed  that  their  scheme  required  fewer  function 
and  gradient  evaluations  on  a  large  set  of  test  problems.  Convergence  results  for  both  schemes  were 
established  under  the  assumption  that  the  objective  function  is  differentiable. 

In  the  nonmonotone  line  search  algorithm  (NLSA)  given  in  [37],  at  each  iteration  the  step  length 
Ofc  is  chosen  to  be  uniformly  bounded  above  and  to  satisfy  the  nonmonotone  Armijo  condition: 

ip(xk  +  adk)  <Ck  +  a5Shffxk)T dk ,  (11) 

where  d k  is  a  descent  direction,  and  Ck  is  a  linear  combination  of  all  the  previous  function  values, 
updated  by  the  formulas 

Qk+l  —  VkQk  T  1;  Cfi- 1-1  —  {jlkQkCk  T  /(^fc+i ) )  / Qk+l  i  (12) 

where  %  >  0  controls  the  degree  of  nonmonotonicity.  Specifically,  the  larger  %  is,  the  more 
nonmonotone  the  scheme  is  allowed  to  be.  Additionally,  one  may  also  require  that  the  Wolfe 
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conditions : 


y^{xk  +  adk)Tdk  >  aVi/j(xk)Tdk,  (13) 

be  satisfied  as  well.  Under  these  settings,  they  proved  global  convergence  for  smooth  functions, 
and  R-linear  convergence  rate  for  strongly  convex  functions. 

In  Section  3,  we  will  extend  the  convergence  proof  in  [37]  to  the  case  of  minimizing  the  non- 
differentiable  function  ip(x)  defined  in  (5). 

2.2  Algorithm  NADA 

We  now  describe  a  nonmonotone  line  search  algorithm  for  solving  the  non-smooth  unconstrained 
minimization  problem  (5),  which  is  equivalent  to  (3).  From  the  standpoint  of  solving  (3),  the 
algorithm  has  a  flavor  of  alternating  minimization  or  block  coordinate  descent,  but  it  does  not 
require  a  monotone  descent.  For  convenience,  we  call  this  approach  Nonmonotone  Alternating 
Direction  Algorithm ,  or  simply  NADA. 

Since  our  objective  function  =  <f>(x,y(x))  is  non-differentiable,  the  nonmonotone  line 

search  algorithm  described  in  [37]  is  not  directly  applicable.  The  main  modification  is  to  replace 
the  search  direction  dk  =  —Vi/j(xk),  which  does  not  exist  in  our  case,  by  dk  =  —d\ (j>{xk,y(xk))- 
Such  a  modification  can  be  justified  as  follows.  Suppose  that  all  the  involved  subdifferentials  exist, 
then  it  follows  from  the  chain  rule  that 

d'ifj(x)  =  di  <p(x,y(x))  +  d24>(x,y(x))dy(x). 

By  the  construction  of  y(x),  we  have  0  G  <92</>(x,  y(x)),  then  d± cj>(x,  y(x))  G  dip(x).  Hence,  the  search 
direction  dk  =  —d\ 4>(xk,  y(x k))  can  be  regarded  as  a  subgradient  direction  for  i/)(x).  However,  since 
we  do  not  require  that  y(x)  be  sub-differentiable,  a  vigorous  convergence  proof  is  still  necessary  for 
such  a  modification. 

To  suite  our  situation  we  need  to  modify  the  nonmonotone  Armijo  condition  into  the  following 
form: 


4>{xk  +  ctdkiVk)  <Ck  +  ad di <j>(xk,  yk)T dk,  (14) 

which  is  just  (11)  applied  to  the  function  yk)  as  a  function  of  x  at  every  iteration.  In  our  imple¬ 
mentation,  we  always  choose  the  search  direction  dk  =  —d\ (j)(xk,yk),  even  though  the  convergence 
theorem  in  the  next  section  actually  allows  a  wider  range  of  choices. 

Algorithm  1  (Nonmonotone  Alternating  Direction  Algorithm). 

Initialize  0<5<l<p,0<  ym\ n  <  r?max  <  1,  amM  >  0,  tol  >  0,  and  Qo  =  1. 

Choose  starting  points  (xo,yo),  and  set  Cq  =  4>(xo,yo)  and  k  =  0. 

While  \\di(f>(xk,yk)\\  >  tol  Do 
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(1)  Compute  dk  =  —di4>(xk,yk),  and  an  initial  trial  step  ak  >  0. 

(2)  Choose  ak  =  akp~6k  where  Ok  is  the  largest  integer  such  that 
oik  <  a  max  and  the  nonmonotone  Armijo  condition  (14)  holds. 

(3)  Set  xk+i  =  xk  +  akdk. 

(4)  For  some  r]k  6  [r/min,  f?max],  compute  Qk+ 1  and  Ck+i  by  (12). 

(5)  Compute  yk+ 1  =  y{x k+i)  =  argmmy<f>(xk+1,y). 

(6)  Increment  k  and  continue. 

End  Do 


Some  additional  remarks  are  in  order. 


Remark  1.  In  our  implementation,  the  initial  trial  step  ak  is  chosen  according  to  the  Barzilai- 
Borwein  method  [1]  to  achieve  good  practical  performance.  For  given  yk,  applying  BB  method  on 
minimizing  4>{x,  yk)  with  respect  to  x  leads  to  a  trial  step 


ak 


slsk 

slzk 


(15) 


or  alternatively  ak  =  s\zkjz{zk,  where  sk  =  xk  -  xk-i  and  zk  =  di <t>(xk,yk)  -  di<f>(xk-i,yk). 


Remark  2.  The  integer  parameter  6k  is  not  necessarily  positive.  In  practical  implementations, 
starting  from  the  BB  step,  one  can  increase  or  decrease  the  step  length  by  forward  or  backward 
tracking  until  the  nonmonotone  Armijo  condition  is  satisfied. 


Compared  to  popular  existing  approaches,  NADA  differs  from  the  traditional  alternating  min¬ 
imization  approach  since  it  does  not  require  exact  (or  high-precision)  minimizers  in  all  directions, 
and  it  differs  from  the  block  coordinate  descent  (BCD)  approach  since  it  does  not  require  a  mono¬ 
tone  decease  in  the  objective  function.  The  convergence  of  Algorithm  NADA  will  be  analyzed  in 
Section  3. 


2.3  TVAL3 

Embedding  the  unconstrained  minimization  algorithm  NADA  into  the  ALM  framework,  we  obtain 
the  following  algorithm  for  solving  the  equality-constrained  minimization  problem  (1). 

Algorithm  2  (Augmented  Lagrangian  Multiplier). 

Initialize  (xq,  yo),  Po  >  0,  and  Xq  =  0  G  Mm.  Set  k  =  0. 

While  “not  converged'’ ,  Do 

(1)  Call  NADA  to  minimize  cj)(x,y)  =  CA{x,y,  Xk;  yk) 
starting  from  (xk,yk),  giving  the  output  (xk+1,yk+i). 

(2)  Update  the  multiplier:  Xk+i  =  Xk  —  p,kh{xk+i,yk+i). 

(3)  If  necessary,  update  the  penalty  parameter  to  get  fak+i. 


(4)  Increment  k  and  continue. 

End  Do 

In  order  to  achieve  low-cost  minimization  with  respect  to  y  (the  non-smooth  part),  a  variable¬ 
splitting  technique  is  usually  coupled  with  this  algorithm.  The  idea  of  variable-splitting  has  been 
used  in  various  fields  for  years,  and  has  recently  been  introduced  into  image  deconvolution  and 
TV  minimization  in  [33].  For  instance,  we  could  split  the  ti-norm  term  from  the  finite  difference 
operation  as  illustrated  in  Section  1.2. 

Specifically,  we  apply  NADA  to  the  variants  of  TV  regularized  linear  inverse  problems  (9) 
and  (10)  presented  in  Section  1.2,  resulting  in  a  solver  that  we  call  TVAL3  [23]  (Total  Variation 
Augmented  Lagrangian  ALternating-direction  ALgorithm).  In  the  particular  case  of  solving  (9), 
we  have 

4>(x,y)  =  CA(x,y, 

=  (\\yi\\v  ~  (Dix  ~  Vi)  +  \  II  Aa;- 2/if')  +  |||Ax-6-  Xfn\\2. 

i  '  ' 

Then  we  can  easily  derive 

di<t>(x,y)  =  Y](pPf(DjX  -  y)  -  Djui )  +  yAT  (Ax  -b  -  A///),  (16) 

i 

and,  when  p  =  2  (isotropic  TV), 

l/i(*)  =  argmin  <f>(x,y)  =  max/ HA*  -  v*/P  ||2  “  4>0}  TT  »  (1T) 

Vi  l  P  )  II  P>iX  Vi/PiW 

which  is  the  so-called  2D  shrinkage  formula.  When  p  =  1,  one  would  apply  an  equally  simple  ID 
shrinkage  formula  to  obtain  yt(x).  As  we  can  see,  the  computation  of  y{x)  is  indeed  very  low  in 
comparison  to  that  of  updating  x  from  solving  (16)  at  a  fixed  y  value.  Hence,  the  structure  of 
uneven  complexity  is  present. 

In  our  implementation  of  NADA  in  TVAL3,  dk  is  chosen  as  the  negative  partial  gradient  given 
in  (16),  and  y(x)  is  computed  using  (17).  A  similar  derivation  is  applicable  to  solving  problem  (10). 

3  Convergence  Results 

As  discussed  earlier,  the  convergence  of  ALM  has  been  thoroughly  studied,  so  the  convergence  of  the 
proposed  Algorithm  2  relies  on  the  convergence  of  Algorithm  1  (i.e. ,  NADA)  for  the  unconstrained 
subproblems.  By  extending  Zhang  and  Hager’s  proof  in  [37],  we  present  a  convergence  result  for 
Algorithm-NADA  in  this  section  (its  proof  is  given  in  the  appendix).  The  fundamental  extension 
is  to  allow  the  objective  function,  previously  assumed  to  be  continuously  differentiable,  to  take 
the  form  f(x,y(x))  where  we  do  not  assume  any  differentiability  of  y{x).  As  such,  NADA  is  no 
longer  a  standard  gradient  or  subgradient  method  previously  studied  in  nonmonotone  line  search 
frameworks 
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3.1  Assumptions  and  Main  Result 


First  recall  the  definition  that  if(x)  =  <f{x,y{x))  for  y(x)  =  argrnin  (j>(x,y).  We  need  to  impose 
the  following  assumption  on  the  function  c f>(x,y). 

Assumption  1.  The  function  <f{x,y)  is  continuously  differentiable  with  respect  to  x,  and  sub- 
differentiable  with  respect  to  y.  Furthermore,  y(x)  =  argmin^  fi(x,  y)  uniquely  exists  for  each  x. 

We  note  that  the  above  assumption  does  not  imply  the  sub-differentiability  of  if(x)  =  < f>(x,  y(x)) 
since  y(x)  is  not  assumed  to  be  sub-differentiable.  In  addition,  fi(x,  y)  does  not  have  to  be  convex 
with  respect  to  y  for  y(x)  to  uniquely  exist. 

Let  d\(j)  and  d2(f  refer  to  the  sub-differentials  of  cf  with  respect  to  x  and  y,  respectively.  The 
convergence  proof  of  NAD  A  makes  use  of  the  following  assumptions. 

Assumption  2  (Direction  Assumption).  There  exist  ci  >  0  and  C2  >  0  such  that 

\  di <f(xk,yk)Tdk  <  —ci\\di<j)(xk,  Vk)\\2, 

l  ||4||  <  c2\\dl<f)(xk,yk)\\. 

This  assumption  obviously  holds  for  dk  =  —d\ (pk(xk)  with  c\  =  C2  =  1.  However,  this  assump¬ 
tion  allows  more  generality.  For  instance,  certain  approximations  to  —d\(j)k{xk)  would  become 
permissible. 

Assumption  3  (Lipschitz  Condition).  There  exists  L  >  0,  such  that  at  any  given  y,  and  for  all  x 
and  x, 


||<9i< p(x,y)  -  dicj>(x,y)\\  <  L\\x  -  x\\.  (19) 

Assumption  4  (Boundedness  from  below).  The  function  cj>(x,y)  is  bounded  below,  i.e., 

f>{x,y)  >  -M 


for  some  M  >  0  and  for  all  (x,y). 

We  note  that  Lipschitz  continuity  and  the  boundedness  from  below  are  widely  assumed  in  the 
analysis  of  convergence  of  gradient-type  methods. 

The  main  convergence  result  of  this  paper  is  as  follows: 


Theorem  1.  Under  Assumptions  1-4,  the  iterate  sequence  {(.Tfc,yfc)}  generated  by  Algorithm- 
NADA,  which  is  well  defined,  satisfies 


lirn  d\<j>(xk,yk)  =  0, 
o 

4>(xk,yk)  3  0. 


(20) 
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Our  proof  of  this  theorem  follows  a  similar  path  as  the  convergence  proof  in  [37],  with  extra 
steps  to  connect  the  non-differentiable  part  with  the  differentiable  part  by  means  of  alternating 
minimization.  We  include  the  detailed  proof  in  the  appendix. 

Based  on  Theorem  1,  we  can  easily  deduce  global  convergence  of  Algorithm-NADA  under  a 
further  convexity  assumption.  We  state  the  following  corollary  without  a  proof. 

Corollary  1.  If(j)(x,y )  is  jointly  convex,  lower  semi- continuous  and  coercive,  then  under  Assump¬ 
tions  1-4  the  sequence  {(xk,yk)}  generated  by  Algorithm-NADA  converges  to  a  minimizer  ( x*,y *) 
of  problem  (3). 

As  indicated  before,  the  convergence  of  the  overall  ALM  algorithm,  Algorithm  2,  follows  from 
that  of  NAD  A. 

4  Numerical  Results  in  Image  Reconstruction 

To  demonstrate  the  performance  of  the  proposed  method  on  problems  with  the  structure  of  un¬ 
even  complexity,  we  conducted  numerical  experiments  on  TV  minimization  problems  from  image 
reconstruction  in  CS;  that  is,  solving  model  (9)  or  (10).  Our  implementation  of  Algorithm  2  is 
the  solver  TVAL3,  which  was  compared  to  three  other  state-of-the-art  TV  minimization  solvers: 
^i-Magic  [6,  7],  TwIST  [3,  4]  and  NESTA  [2].  All  experiments  were  performed  on  a  Lenovo  X301 
laptop  with  a  1.4GHz  Intel  Core  2  Duo  SU9400  and  2GB  of  DDR3  memory,  running  Windows  XP 
and  MATLAB  R2009a  (32-bit). 

Throughout  the  experiments,  we  always  used  a  default  set  of  parameter  values  for  TVAL3. 
Specifically,  we  set  rj  =  .9995,  p  =  5/3,  5  =  10-5  and  amax  =  104  (see  Algorithm  1),  and  initialized 
multiplier  estimate  to  the  zero  vector  as  presented  in  Algorithm  2.  Additionally,  the  penalty 
parameter  might  vary  in  a  range  of  25  to  29  according  to  distinct  noise  level  and  required  accuracy. 
In  spite  of  a  lack  of  theoretical  guidance,  we  have  found  that  it  is  not  particularly  difficult  to  choose 
adequate  values  for  penalty  parameters  since  the  algorithm  is  not  overly  sensitive  to  such  values  as 
long  as  they  fall  into  some  appropriate  but  reasonably  wide  range.  A  few  trial- and-error  attempts 
are  usually  needed  to  find  good  penalty  parameter  values,  judged  by  observed  convergence  speed. 

In  an  effort  to  make  comparisons  as  fair  as  possible,  for  each  of  the  aforementioned  solvers  we 
tried  to  tune  parameters  provided  in  the  interface  of  the  solver  in  order  to  obtain  as  good  a  per¬ 
formance  as  we  possibly  could.  Nevertheless,  the  possibility  still  exists  that  a  solver’s  performance 
could  be  further  improved  by  making  changes  that  we  are  unaware  of.  Therefore,  the  performance 
data  presented  in  this  section  should  be  considered  only  within  the  context  of  our  tests,  but  not  be 
regarded  as  general  judgements. 
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4.1  Tests  on  Synthetic  Data 


The  first  three  tests  are  base  on  synthetic  data.  In  the  first  test,  the  data  x  is  an  n  =  64  x  64 
phantom  image  from  which  an  observation  b  =  Ax  is  generated  without  additive  noise.  The  matrix 
A  e  Mmxn,  where  m  =  0.3n,  is  generated  from  orthogonalizing  the  rows  of  a  Gaussian  random 
matrix  by  QR  factorization.  Then  we  apply  TVAL3,  TwIST,  NESTA  and  Id-Magic  to  model  (7) 
and  obtain  a  recovered  image  from  each  solver.  The  quality  of  recovered  images  is  measured  by  the 
signal-to-noise  ratio  (SNR).  Parameters  were  extensively  tuned  to  achieve  a  near-best  performance 
possible.  The  test  results  are  presented  in  Figure  1. 

Original  TVAL3  TwIST 


i  I 


SNR:  77.64dB,  CPU  time:  4.27s  SNR:  46.59dB,  CPU  time:  13.81s 

NESTA  A -Magic 


SNR:  34.1 8dB,  CPU  time:  24.35s  SNR:51.08dB,  CPU  time:  1558.29s 

Figure  1:  Recovery  of  a  64  x  64  phantom  image  (shown  in  the  top-left)  from  30%  noiseless  measure¬ 
ments.  Top-middle:  reconstructed  by  TVAL3.  Top-right:  reconstructed  by  TwIST.  Bottom- 
left:  reconstructed  by  NESTA.  Bottom-right:  reconstructed  by  Ai-Magic. 


From  Figure  1,  we  observe  that  TVAL3  achieved  the  highest  SNR  at  77.6dB,  while  taking  the 
shortest  running  time  (4.3  seconds).  The  second  highest  SNR  was  obtained  by  Ai-Magic  at  51.1dB 
at  the  cost  of  taking  an  unacceptable  amount  of  time  (1558.3  seconds).  TwIST  and  NESTA  attained 
relatively  medium-quality  images  (SNR  around  46.6dB  and  34.2dB  respectively)  within  reasonable 
running  times  (13.8  and  24.4  seconds,  respectively).  This  test  suggests  that  TVAL3  is  capable 
of  achieving  high  accuracy  within  an  affordable  running  time  on  the  tested  image  reconstruction 
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problems,  outperforming  some  state-of-the-art  solvers. 

The  second  test  is  much  more  challenging  with  a  larger  image  and  rather  high-level  Gaussian 
noise.  Specifically,  the  data  is  a  256  x  256  MR  brain  image  contained  rather  complicated  features. 
This  time,  the  measurement  matrix  A  is  a  column  permuted,  partial  Walsh  Hadamard  matrix  with 
only  10%  rows  selected  at  random.  To  the  observation  vector  b  =  Ax  we  added  Gaussian  noise  at 
the  level  of  10%  in  magnitude.  Specifically,  a  noisy  observation  is  synthesized  by  the  formula  (in 
the  Matlab  format) 


b  =  b  a  *  mean  (abs  (6))  *  randn  (m,  1),  (21) 

where  b  6  Mm  on  the  right-hand  side  is  the  noiseless  observation,  and  a  represents  the  noise  level. 

From  the  first  test  on  the  phantom  image,  we  know  that  G-Magic,  though  producing  good 
quality  solutions,  can  become  excessively  expensive  on  relatively  large-scale  problems.  For  this 
reason,  we  excluded  it  from  the  second  test.  In  Figure  2,  we  present  test  results  for  TVAL3,  TwIST 
and  NESTA  solving  ROF  model  (8).  We  observe  from  Figure  2  that  TVAL3  produced  the  best 
recovery  quality  with  the  shortest  amount  of  running  time,  TwIST  produced  the  poorest  recovery 
quality  with  the  longest  amount  of  running  time,  while  NESTA  is  in  the  middle  on  both  accounts. 
These  results  indicate  that  TVAL3  is  likely  to  be  more  efficient  and  more  robust  in  solving  certain 
highly  difficult  problems. 

Finally,  in  the  third  test  we  fix  the  Gaussian  noise  level  to  10%  and  repeat  the  experiment  for 
90  different  sampling  ratios  ranging  from  9%  to  98%  with  1%  increment.  All  the  parameters  are 
set  as  the  same  as  in  the  second  test.  The  results  are  plotted  in  Figure  3,  showing  the  recovery 
quality  and  running  time  for  TVAL3,  TwIST  and  NESTA.  Figure  3  indicates  that  on  these  test 
cases  TVAL3  always  achieves  the  best  quality  (highest  SNR)  with  the  shortest  running  time  among 
the  three  tested  solvers.  TwIST  and  NESTA  attain  similar  accuracy,  but  TwIST  is  much  slower 
especially  when  the  sampling  ratio  is  relatively  low.  These  facts  are  consistent  with  what  we  have 
discovered  from  Figure  2. 

4.2  A  Test  with  Hardware-Measured  Data 

To  see  the  performance  of  the  solvers  under  a  more  realistic  environment,  we  did  a  test  using  data 
collected  by  Rice’s  single-pixel  camera  [32],  Simply  speaking,  it  is  a  compressive  sensing  camera 
using  digital  micro- mirror  device  to  generate  measurements  (for  more  details  see  [32])  .  In  this  test, 
we  focused  on  reconstructing  infrared  data  captured  by  this  single-pixel  camera. 

As  is  shown  in  Figure  4,  a  canvas  board  with  two  letters  “IR”  written  on  it  by  charcoal  pencil  is 
entirely  covered  by  blue  oil  paint,  which  makes  the  letters  “IR”  invisible  to  human  eyes.  This  board 
was  illuminated  by  a  150-watt  halogen  lamp  and  measurements  were  gathered  by  the  single-pixel 
camera  equipped  with  an  infrared  sensor.  We  applied  TVAL3,  TwIST,  NESTA  and  G-Magic  to 
ROF  model  (8)  in  order  to  recover  the  image,  respectively  from  top  to  bottom  in  Figure  5,  where 
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Original 


TwIST 


SNR:  4.66dB,  CPU  time:  142.04s 


TVAL3 


SNR:  9.40dB,  CPU  time:  10.20s 


NESTA 


SNR:  8.03dB,  CPU  time:  29.42s 


Figure  2:  Recovery  of  a  256  x  256  MR  brain  image  (top-left)  from  10%  measurements  with  noise 
at  10%  level.  Top-right:  reconstructed  by  TVAL3.  Bottom-left:  reconstructed  by  TwIST. 
Bottom-right:  reconstructed  by  NESTA. 
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Figure  3:  Recoverability  for  256  x  256  MR  brain  image.  The  noise  level  is  10%.  Left:  average 
SNR.  Right:  average  running  time.  SNR  and  running  time  are  measured  simultaneously  with  the 
growth  of  the  sampling  ratio. 


the  recovered  images,  from  left  to  right,  are  corresponding  to  15%,  35%,  and  50%  sampling  ratios, 
respectively. 

In  this  test,  measurements  were  not  synthesized,  but  collected  from  hardware.  Hence,  there  is 
no  “ground-truth”  solution  available.  As  such,  recovery  quality  can  only  be  judged  by  subjective 
visual  examinations.  It  is  perhaps  agreeable  that,  in  Figure  5,  the  results  by  TwIST  (on  the  second 
row)  are  visually  inferior  to  others.  Another  unmistakable  observation  is  that  ^i-Magic  took  at 
least  10  times  longer  running  time  than  others,  while  the  others  required  much  less  running  times. 

In  short,  numerical  results  indicate  that  TVAL3  is  at  least  competitive  to  other  state-of-the-art 
TV  solvers  for  CS  reconstruction.  In  fact,  it  seems  to  be  more  efficient  and  more  robust  in  most 
tests  using  synthetic  data. 

5  Conclusions 

In  this  paper,  we  have  proposed,  analyzed  and  tested  an  algorithm  for  solving  non-smooth  uncon¬ 
strained  optimization  problems  with  a  structure  called  uneven  complexity  in  terms  of  minimizing 
the  objective  with  respect  to  two  groups  of  variables.  Such  a  structure  widely  exists  in  application 
problems  including  some  total-variation  minimization  problems  in  various  image  processing  tasks. 
The  proposed  algorithm  effectively  integrates  the  ideas  of  alternating  direction  and  nonmonotone 


15 


Figure  4:  Target  image  under  visible  light  with  letters  IR  covered  by  paint. 


line  search  to  take  advantages  of  both  techniques,  leading  to  relatively  low-cost  iterations  for  suit¬ 
ably  structured  problems. 

We  have  established  convergence  for  this  algorithm  by  extending  convergence  results  in  [37]  that 
are  applicable  only  to  smooth  objective  functions.  When  embedded  into  the  ALM  framework  as 
the  subproblem  solver,  the  proposed  approach  leads  to  efficient  ALM  implementations  for  solving 
targeted  equality-constrained  optimization  problems.  Based  on  this  approach,  a  TV  minimization 
solver  called  TVAL3  is  constructed.  Extensive  experiments  demonstrate  that  TVAL3  compares 
competitively,  and  often  favorably,  with  several  state-of-the-art  TV  solvers  in  the  field. 
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TVAL3 


15%  measurements 


CPU  time:  7.61s 


35%  measurements 


CPU  time:  6.79 


50%  measurements 


CPU  time:  6.04s 


TwIST 


CPU  time:  9.41s 


CPU  time:  9.66s 


CPU  time:  10.42s 


NESTA 


CPU  time:  8.31s 


CPU  time:  7.16s 


CPU  time:  6.75s 


U -Magic 


CPU  time:  194.63s 


CPU  time:  177.19s 


CPU  time:  158.51s 


Figure  5:  Recovery  of  a  256  x  256  infrared  image.  The  four  rows  (top  to  bottom)  are  reconstructed 
by  TVAL3,  TwIST,  NESTA  and  ^i-Magic,  respectively,  for  sampling  ratios  (left  to  right)  15%, 
35%,  and  50%,  respectively. 
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Appendix:  Proof  of  Theorem  1 

For  notational  simplicity,  let  us  define 

4>k(-)  =  (/>{-, Vk)  and  V </>*,(•)  =  <9i  <£(•,  Vk)-  (22) 

The  proof  of  the  theorem  relies  on  two  lemmas.  The  two  lemmas  are  modifications  of  their 
counterparts  in  [37].  Since  our  objective  may  contain  a  non-differentiable  part,  the  key  modifi¬ 
cation  is  to  connect  this  non-differentiable  part  to  the  differentiable  part  by  means  of  alternating 
minimization.  Otherwise,  the  line  of  proofs  follows  closely  that  given  in  [37]. 

The  first  lemma  presents  some  basic  properties  and  established  that  the  algorithm  is  well- 
defined. 

Lemma  1.  If  \7(fk(xk)Tdk  <  0  holds  for  each  k,  then  for  the  sequences  generated  by  Algorithm- 
NADA,  we  have  4>k{xk )  <  4>k- i(xk)  <  Ck  for  each  k  and  {Cfc}  is  monotonically  non-increasing. 
Moreover,  if  V 4>k(xk)T dk  <  0 ,  a  step  length  ak  >  0  always  exists  so  that  the  nonmonotone  Armijo 
condition  (11)  holds. 

Proof.  Define  real-valued  function 

=  tCk- 1  +  h-i(xk)  > 

t  + 1 

then 

D'k(t)  =  Ck~l  ~  for  t  >  0. 

ky  (t+  l)2 

Due  to  the  nonmonotone  Armijo  condition  (11)  and  ^  (fk{xk)T  dk  <  0,  we  have 

Cfc— 1  $k—l(.Xk)  —  1  {.Xk—  1 )  1  —  0. 

Therefore,  D'k(t )  >  0  holds  for  any  t  >  0,  and  then  D k  is  non-decreasing. 

Since 

Dk{  o)  =  4>k- i{xk)  and  Dk{iqk_iQk-i)  =  Ck, 

we  have 

<f>k-i(xk)  ft  Ckl  Vfe. 

As  is  defined  in  Algorithm-NADA, 


Vk  =  argmin  <j)(xk,y). 
y 


Therefore, 


4>(xk,yk )  <  4>{xk,yk-i). 


Hence,  c i>k{xk )  <  4>k-i(xk )  <  Cfc  holds  for  any  fc. 
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Furthermore. 


i.e., 


Ck+ 1  — 


( JlkQkC’k  “1“  0fc(*^fc+l))  ^  MkCk  T  Ck- 1-1 


Qk+1 


Qk+ 1 


{.VkQk  T  l)Cfe+l  —  {VkQk^k  T  Cfe+l)j 


i.e., 

Ck+i  fs  Cfc- 

Thus,  {Cfc}  is  monotonically  non-increasing. 

If  Ck  is  replaced  by  (f>k{xk)  in  (11),  the  nonmonotone  Armijo  condition  becomes  the  standard 
Armijo  condition.  It  is  well-known  that  ak  >  0  exists  for  the  standard  Armijo  condition  while 
\/4>k(xk)Tdk  <  0  and  <f>  is  bounded  below.  Since  4>k(xk)  <  Ck,  it  follows  ak  >  0  exists  as  well  for 
the  nonmonotone  Armijo  condition: 


4>k(xk  +  akdk )  <  Ck  +  5akS7(j)k(xk)Tdk. 


Now  we  defining  the  quantity  Ak  by 

1  ^ 

Ak  =  F+T  ^{xk)-  (23) 

2—0 

By  induction,  it  is  easy  to  show  that  Ck  is  bounded  above  by  Ak-  Together  with  the  facts  that 
Ck  is  also  bounded  below  by  (j>k(xk )  and  ak  >  0  always  exists,  it  is  clear  that  Algorithm-NADA  is 
well-defined.  □ 


In  the  next  lemma,  a  lower  bound  for  the  step  length  generated  by  Algorithm-NADA  will  be 
given. 


Lemma  2.  Assume  that  V 4>k(xk)T dk  <  0  for  all  k  and  that  Lipschitz  condition  (19)  holds  with 
constant  L.  Then 


oik  >  min 


^max  2(1  -  5)  \V<t>k(xk)Tdk\ \ 
p  ’  lp  ii4ii2  r 


(24) 


Proof.  It  is  noteworthy  that  p  >  1  is  required  in  Algorithm-NADA.  If  pak  >  ccmax)  then  the  lemma 
already  holds.  Otherwise, 

POik  —  OikP  k  <  Ckmax, 


which  indicates  that  6k  is  not  the  largest  integer  to  make  the  /c-th  step  length  less  than  amax- 
According  to  Algorithm-NADA,  4  must  be  the  largest  integer  satisfying  the  nonmonotone  Armijo 
condition  (11),  which  leads  to 


4>k(xk  +  pakdk )  >  Ck  +  5pakV(/)k(xk)Tdk. 


23 


Lemma  1  showed  C k  >  <4(xfc)>  so 

4>k(xk  +  potkdk )  >  4>k(xk)  +  5pakV4>k(xk)Tdk. 

On  the  other  hand,  for  a  >  0  we  have 

PCX 

/  {V(j>k(xk  +  tdk)  -  V(j)k(xk))  dk  dt  =  (j)k(xk  +  adk )  -  f)k{xk )  -  aV4(xfc)T4. 

Jo 

Together  with  the  Lipschitz  condition,  we  obtain 

PCX 

fk(xk  +  adk)  =  (t>k{xk)  +  aV 4>k{xk)T dk  +  I  (V4(xfc  +  tdk)  -  V^^.))  dk  dt 

Jo 

PCX 

<  4>k(xk)  +  aV (j)k{xk)T dk  +  /  tL\\dk\\2  dt 

Jo 

=  4>k(xk )  +  aS7  4>k{xk)T  dk  +  ^La2\\dk\\2. 


(25) 


Let  a  =  pak,  then 


4>k(xk  +  /oafc4)  <  0k(xk)  +  pakVMxkfdk  +  -Lp2a2k\\dk\\2 . 


(26) 


Comparing  (25)  to  (26),  we  deduce  that 


1 


(5-  l)y (t>k{xk)T dk  <  ^Lpak\\dk\\2. 


Since  V 4>k(xk)T 4  <  0, 


ak  > 


2(1-  6)\VMxk)Tdk\ 


Lp  II4II2 

Therefore,  the  step  length  ak  is  bounded  below  as  in  (24). 


□ 


With  the  aid  of  the  lower  bound  (24),  we  now  are  ready  to  prove  Theorem  1.  We  need  to 
establish  the  two  relationships  given  in  (20). 

Proof.  First,  by  definition  in  Algorithm-NADA, 

yk  =  argmincj b(xk,y). 
y 


Hence,  it  always  holds  true  that 


0  e  d2 <j>(xk,yk). 


Now  it  suffices  to  show  that  the  limit  holds  true  in  (20).  Consider  the  nonmonotone  Armijo 
condition: 


4>k(xk  +  akdk)  <Ck  +  5aky (fk{xk)T dk. 


(27) 
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If  pak  <  ctmax,  in  view  of  the  lower  bound  (24)  on  a k  in  Lemma  2  and  the  direction  assumption 

(18), 


4“  ®fc4)  —  Cfc  *5 

<  ck- 
=  ck- 


2(1  -  5)  \V(j>k(xk)T  dk\" 


Lp  ||4||2 
2*5(1  -<5)  c?||V4(xfc)||4 

Lp  ci||V4(.xfc)||2 
2<5(l-<5)cf 


Lpc^ 


II  vM*kW. 


On  the  other  hand,  if  pak  >  amax,  the  lower  bound  (24),  together  with  the  direction  assumption 
(18),  gives 

(j)k{xk  +  otkdk)  <  Cfc +  5afcV^fc(xfc)T4 
<  Ck  -  5akci\\\7 (t>k(xk)\\2 

5ftmax*-l  Hn  j  /  x  1 1 2 


<  ck- 


p 


■Iiv4(xfc)lr- 


Introducing  a  constant 


r  =  mm 


25(1  -  <5)c?  <5amaxci  1 


P 


S' 


we  can  combine  the  above  inequalities  into 

<t>k(xk  +  afe4)  <  Cfc  -  f  ||V4(xfc)||2. 


(28) 


Next  we  show  by  induction  that  for  all  k 


1 

1  Pmaxi 
Wk 


(29) 


which  obviously  holds  for  k  =  0  given  that  Q o  =  1-  Assume  that  (29)  holds  for  k  =  j.  Then 


Qh  1  =  + 1  <  T-Pti—  +  1  <  T^-  + 1  =  4— 

1  Pmax  1  Vmax  1  Pmax 


implying  that  (29)  also  holds  for  k  =  j  +  1.  Hence,  (29)  holds  for  all  k. 
It  follows  from  (28)  and  (29)  that 


Ck  Ck- i-i  — 


> 

> 


VkQkC'k  T  (t>k{-^k+ 1) 

Q w 

Ck(j]kQk  T  1)  {f]kQkCk  4“  (j)k{xk-\- 1)) 
Qk+ 1 

Ck  cj)k(xk-\.\) 

Qk+l 

f\\^A{xk)\\2 

Qk+ 1 

t(1  rymaa;)  || V0fc(xfc)  ||  . 


(30) 
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Since  0  is  bounded  below  by  assumption,  {Ck}  is  also  bounded  below.  In  addition,  by  Lemma  1, 
{Ck}  is  monotonically  non-increasing,  hence  convergent.  Therefore,  the  left-hand  side  of  (30)  tends 
to  zero,  so  does  the  right-hand  side;  i.e. ,  ||V<^(xfc)||  — >  0.  Finally,  by  definition  (22), 

lim  di 4>(xk,yk)  =  0, 

fc— o 

which  completes  the  proof.  □ 
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